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Abstract 



In this paper, a method of pitch tracking based on variance minimization of locally pe- 
riodic subsamples of an acoustic signal is presented. Replicates along the length of the 
periodically sampled data of the signal vector are taken and locally averaged sample vari- 
ances are minimized to estimate the fundamental frequency. Using this method, pitch 
tracking of any text independent voiced signal is possible for different speakers. 



Extraction or determination of fundamental frequency (or pitch) of a speech signal is a fundamental 
problem in both speech processing and speaker recognition. The typical pitch range for a male human 
being is 80-200 Hz, and for females 150-350 Hz. Many methods to extract the pitch of speech signals have 
been proposed. Improvements in accuracy of performance, robustness against noise of these methods are 
still desired. As a whole, we do not have any reliable and accurate method for pitch extraction. Also 
measuring the period of a speech waveform, varying in and with the detailed structure of the waveform, 
can be quite difficult. Another problem is automatic selection of the window of the voiced speech segments. 

Autocorrelation method [Rabiner, 10] and the average magnitude difference function (AMDF) method 
[Ross et. al., 15] are known to be the most primitive standard methods to find pitch. Based on these 
two methods several refinements like auditory modeling [Cosi et. al., 3], probabilistic AMDF modeling 
[Jamieson et. al., 5], real-time digital hardware pitch detector [Rabiner et. al., 12], semiautomatic 
pitch detector (SAPD) [Rabiner et. al., 13], automatic formant analysis [Rabiner et. al., 14], weighted 
autocorrelation [Shimamura et. al., 16], modified autocorrelation and AMDF [Tan et. al., 17], projection 
measure technique [Yuo et. al., 18], pseudo-pitch synchronous analysis [Zilca et. al., 19] and many more 
[Rabiner et. al., 11] are proposed. Some other ideas on pitch extraction have also been discussed in the 
paper [Marchand, 9] and some tutorials [Gerhard, 4] and [Campbell, 6]. 

For an idealized speech signal in a stationary noisy environment the following mathematical abstraction 
has been consistently assumed in this paper. 



where {zj} is an uncorrelated sequence of random variables or white noise and {cj} are coefficients 
of discrete spectrum of stationary noise. In case the main signal {y n } possesses a pure fundamental 
frequency (or pitch), which is again another idealized view, it is assumed in this paper that f(j) = jf p 
for a suitable pitch value f p . The actual situation can become complicated further if the idealized signal 
contains multiple pitch streams or is convoluted with a channel filter. Therefore, from a practical point 
of view, the estimation of the pitch of a signal is essentially a statistical problem. Here we propose 
a new method for extraction of fundamental frequency of speech signal in clean and to some extent 
noisy environment using simple statistical techniques. Motivationally, the technique has similarity with 
Zero Crossing based techniques (Kedem, [8] and Gerhard, [4]), however, the theory is much simpler and 
statistical in nature. The main novelty of our approach actually lies in formulating the problem in this 
manner and putting a certain number of standard measures (such as, Autocorrelation and AMDF) of 
pitch in the same framework with our proposed measure (Average Squared Mean Difference Function or 
ASMDF). This way a more comprehensive approach is presented and left open for further refinements. 

The remainder of this paper is organized as follows. Section II describes the motivation and the 
principle of the proposed method. In Section III, we show the results of preliminary tests for the proposed 
method and comparing with some standard pitch detection methods we confirm the effectiveness of our 
method. In Section IV, we give an error analysis of the method. In section V, we conclude this paper 
giving views regarding further development that can be done. In the appendix, given as in Section 
VT. irrmorta.nt calculations, which show the link between the nrnnosed method and the autocorrelation 
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II. PROPOSED METHOD 

Given a discrete time signal y = (j/i, y2, ■ ■ ■ , y n ) (which is the real part of the complex idealized signal 
(1)), the autocovariance function is defined as 

n-\k\ 

r v( k ) = — .Tj E (f* ~ - y)> ( 2 ) 

' ' i=0 

and the autocorrelation function is defined by 

*<*> - < 3 » 

defined for all n and lag fc. 

A variation of autocorrelation analysis for measuring the periodicity of voiced speech uses the average 
magnitude difference function (AMDF), defined by the relation 

n-\k\ 

D v( k ) = — rn J2 \yj+k-yj\- 
j=i 

In case of AMDF, D y (k) has been approximated by a scalar multiple of 

1/2 
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which again has been approximated by the scalar multiple of [2{r y (0) — r y {k)}] 1 ^ 2 , where r y (k) is the 
autocovariance defined as above. These approximations may suppress calculations important for pitch 
extraction. 

Also AMDF can be considered as a replica of Ginis mean difference formula ([Goon et. al., 7], page 
233-234). But the new method described below gives more optimal results as it is a replica of variance of 
the data set. So we propose a new function which has been linked to the autocorrelation and refined in 
the Appendix. 

Consider the voiced segment y = (yi,y2, ■ ■ ■ ,y n ) in a digital speech signal. Since most speech signals 
can be viewed as a quasi-periodic sequence the fundamental frequency may not be uniquely defined 
mathematically. In our approach we estimate the fundamental frequency by statistically enhancing the 
most significant harmonics present in y. 

We describe below our algorithm for estimation of the pitch in this voiced segment. For 1 < i < n, 
and k > 1 we consider the downsampled subsets (windows) of the original signal, 

Vi,k = (yi+pk -P = 0, ±1, ±2, . . .). 

Note that due to finiteness of the data stream, for each (i, k) pair we have to consider only those values 
of p so that yi+pk is within the range. Furthermore, for several (i, k) pairs, will become singleton 
and they will not come under further considerations. It is worth bringing in the issue of aliasing here. 
The parameter k is treated like a trial wavelength parameter so that smaller values of k relates to higher 
frequencies and vice versa. We restrict lower limit of this parameter and assume k > k as an adjustment 
for the Nyquist rate based on the original sampling rate of the signal. However, as our goal is estimation of 
the fundamental frequency of the voiced part we shall generally be interested in the higher range of values 
of k. This raises the issue of aliasing. Again based on the sampling rate of the signal an upperbound for 
k, namely k max , needs to be set so that overall sampling rate after downsampling stays above 40 kHz. 
One of the basic assumption is that the fundamental frequency of the voiced part is estimable (in terms 
of both Nyquist rate and aliasing) for the given signal. 
Next, let for each k > ko, define 

Sk — {i ■ 1 < i < n; yi t k has at least two elements} , 

and, let qk be the number of elements in Sk- This automatically sets an upperbound for k < k max < 
([(n + l)/2] — 1) ([x] being the greatest integer less than x). Finally define for k values with qk > 0, 

9ik) = - E Var (^)' (4) 
9k *es, 
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range of k) of downsamplcd signals and robustness of the method under noisy environments considered 
are largely related issues. 

Let f be the sample rate of the original speech signal and f(k) = where k < k < k max . In 
view of (4), g can be thought of as a function of /. Also g can be thought of as a mean squared mutual 
difference function which can be approximated with the standard autocorrelation function. 

Let i be the index of the second minimum of the components of g(k), i.e. g(i) = minkg{k) Then 
f p = f(i) is referred as the estimated fundamental frequency of the speech signal of y. 

We further assume that the white noise sequence {zj} in (1) is Gaussian and carry out the Likelihood 
Ratio test of the null hypothesis, H : E(y iy k) = 1) ■ • •)> the ASMDF statistic g(k) coincides with a 
measure of departure from this null hypothesis. This is the motivation behind the main proposal in the 
present paper. 

III. EXPERIMENTS 

A. Synthetic Data 

Experiment No. 1. We have taken the function y n = sin(x n ) for x n = -^ L ; n = 0, 1, 2, . . . , 11000. 
Then we have plotted g(k) for the 100th window for different k and also plotted r y (k) (autocorrelation 
values) and D y (k) (AMDF values) against k in the same graph. 




Figure 1 : Graph of g(k) 
against k for y — sin(x) 
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Figure 2 : Graph of g(k), r y (k) (dashed) 
and D y (k) (dotted) for y = sin(x) 



Figure 3 : Graph of pitch 
for y = sin(x) 



Here we see that peaks in the autocorrelation graph and dips in the AMDF and ASMDF occur for 
the same k. Next we present the graph of fundamental frequency which is expected to be 100Hz, for the 
same data with sample rate 11000Hz, a window size of 400 samples and shifting the window along 55 
consecutive samples, using ASMDF. Autocorrelation and AMDF give identical graph. 



Experiment No. 2. We have taken a r , 



sin(b n ) for b n = ^-;n 



0,1,2, 



, 5500 and c„ 



cos(d n ) for d n = ^-;n = 0,1,2,..., 5500. Then y has been taken as a linear combination of a n and 
c n as y = 0.47a„ + 0.59c„. Now we plot g(k) for the 100th window for different k and also plot r y (k) 
(autocorrelation values) and D y {k) (AMDF values) against k in the same graph. 
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Figure 4 : Graph of g(k) 
against k for y — 0.47a n + 0.59c n 
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Figure 5 : Graph of g(k), r y (k) (dashed) 
and D y (k) (dotted) against k for y = 0.47a n + 0.59c n 



Here we see that peaks in the autocorrelation and dips in AMDF graphs tally with each other but 
dips in ASMDF sometimes matches the above two and sometimes falls just short of them. The reason 
is the peaks and the dips for all three methods are supposed to appear at the indices which are mul- 
tiples (approximately) of the smallest index. Here the first peak and dip are at the 22nd index for all 
three methods. The second peak appears for autocorrelation at the 46th index and the second dip for 
AMDF appears there too. But for ASMDF, the second dip appears at the 45th index which is a better 
approximation to find pitch. Thus the graph of fundamental frequency, which is expected to be 250Hz 
(5500/ {2 x gcd([55], [56 x 2/5]) = 250}) for the same data with sample rate 5500Hz, a window size of 
400 samples and shifting the window along 55 consecutive samples, is found the following way 
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Figure 6 : Graph of pitch for AMDF 
for y = QA7a„ + 0.59c n 
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Figure 7 : Graph of pitch for autocorrelation and 
ASMDF for y = 0A7a n + 0.59c n (since they are identical) 



B. Speech Data 

Experiment No. 3. Clean speech signals with unknown true pitch values were obtained from IViE 
Corpus [1]. Speech samples were uttered by one female speaker (Fl) and one male speaker (Ml). Each of 
such speech signals consisted of a maximum of five English words, which were sampled at different rates. 
Taking window size of 1000 samples we found data sets of pitch and the following graphs. 

To investigate the accuracy of the ASMDF, we have conducted experiments which compare it with 
two conventional methods. They are the methods of Autocorrelation and AMDF. 

Autocorrelation method does the correlation analysis frame-by-frame to the estimated average pitch 
period of the speaker. The property of this function is that r y (k) is large when y n has similar value with 
y n +k- If y n has a pitch period i, then r y (k) has peaks at the integral multiples of i. Obviously r y (0) 
is maximum among these values, the second largest being r y {i). Other maxima usually decrease as k 
increases. Therefore using this method we can estimate i from the location of the peak at k = i. 

A variation of autocorrelation analysis for measuring the periodicity of voiced speech uses the AMDF. 
The separation of the nulls that appear in calculating D y {k) is a direct measure of the pitch period. 



Figure 8 (a) : Graph for pitch of 
speaker Fl using ASMDF 



Figure 8 (b) : Graph for pitch of 
speaker Fl using AMDF 
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Figure 9 (a) : Graph for pitch of 
speaker Ml using ASMDF 
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Figure 9 (b) : Graph for pitch of 
speaker Ml using AMDF 
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Observations : Fl : Here major fluctuations around 280th, 470th and 920th ms for the method 
autocorrelation and AMDF are observed, whereas no major fluctuations but minor ones are there in the 
segment for the method ASMDF. 

Ml : Here the method ASMDF is much consistent with the methods autocorrelation and AMDF. 
From 180th and 230th ms major fluctuations are there for the methods autocorrelation and AMDF but 
not for ASMDF. 



Experiment No. 4. One male (RL) and one female speaker (SB) each spoke 50 sentences, out of 
which, fifteen speeches (with known pitch or Laryngeal frequency contour in XMG format) were taken 
from FDA Evaluation Database [2] for experiment. Taking window size of 400 samples, we found data 
sets of pitch and the following graphs (where sample rate of each the Laryngograph waveform is given to 
be 20000Hz) of the first (001) speeches. All other graphs were observed to have similar behaviour. All 
the graphs have time (in ms) as horizontal axis and pitch (in Hz) as vertical axis. 

Graphs of pitch of speaker RL speaking speech 001: 
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Fig. 10 (a), (b): Graphs of true and ASMDF pitch values respectively. 
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Fig. 10 (c), (d): Graphs of Autocorrelation and AMDF pitch values respectively. 

Graphs of pitch of speaker SB speaking speech 001: 
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Fig. 11 (a), (b): Graphs of true and ASMDF pitch values respectively. 
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Fig. 11 (c), (d): Graphs of Autocorrelation and AMDF pitch values respectively. 



C. Synthetic and Noisy Data (without true pitch values) 

Experiment No. 5. Tones of seven notes, each of length 1.6 seconds (approx.), have been played by 
a synthesizer of model no. SA21 from Casio in its harmonica mode and recorded in Acer Travelmate 240 
Laptop using Audacity 1.2.3 with sample rate 44100 Hz. Following pitch graphs (with pitch axis against 
k axis) of the note "Do" have been found using the three methods. All other graphs from the remaining 
notes were observed with similar behaviour. 
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Fig. 12 (a), (b), (c): Graphs of pitch for Do in Casio SA21 (Harmonica) by ASMDF, AMDF and Autocorrelation 

respectively. 



From the graphs we see that the noises and various encoding-decoding mismatches of signals are 
causing the distortions, but mainly the harmonica mode in the synthesizer here, being a well-known 
multiphonic device itself, has no fixed pitch value. Here too, ASMDF is much robust against mismatches 
and noises than those of autocorrelation and AMDF, which supports our claim. 

IV. ERROR ANALYSIS (of exp. no. 4) 

Let P c , P m and P s be the pitch contour found using autocorrelation, AMDF and ASMDF respectively 
calculated over the same window for the above three speeches of the two speakers. Let P be the true 
pitch value over the same window. Let us denote 

e. = (P- Ps)/P ,e c = (P- P c )/P and e m = (P - P m )/P 

as the relative pitch error or percentage gross error. Let us define the standard deviation of the relative 
pitch error as 




where e(i) = e s (i), e m (i), e c (i); L e being the length of each of the relative pitch error and e = e (*) 
is the mean pitch error. The experimental values of a e viz. <r e (c), a e (m) and <r e (s) for autocorrelation, 
AMDF and ASMDF respectively, is given in the following table: 
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Here, "<r e (s) for speech RL001 is 0.0248" means that standard deviation of gross pitch error for speech 
RL001 by ASMDF is 2.48% with respect to the true pitch value. If the standard deviation of gross pitch 
error is more than 20% (threshold) with respect to the true value, it's consistency is questionable. 

Also the correlation chart where correlation coefficients of (P, P s ), (P, P c ) and (P, P m ) are calculated 
as r s , r c and r m respectively and are given as follows: 
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V. CONCLUSION 

Based on the experimental results and error calculations it has been shown that ASMDF is useful in 
clean environment. It also makes a refinement of the autocorrelation-based methods to a great extent. 
The idea behind this method also leads to methods of extraction of other features of speech signals. Also 
there is a scope of using cutting and smoothing technique(s) for evaluation of pitch. The limitation of 
this method is the procedure to obtain the minimum values from the data set which are often used to 
fall behind the known limits viz. 80-200 Hz for males and 150-350 Hz for females. 

VI. APPENDIX 

A. RELATIONSHIP BETWEEN ASMDF AND THE AUTOCORRELATION FUNCTION 
Note that, g(k) can alternatively be defined as follows. Define the set 

Cfc = ■ 1 < i,j < n, i ^ j, \i - j\ is divisible by k}, 

for ko < k < k max , where \Ck\ denotes the cardinality of Ck- Then 

g( k ) = 7^7757 (^-%) 2 > 

fe| (i,j)ec h 

provided Ck is non-empty. Otherwise g is not defined. This follows from standard algebra with the 
formula for variance. Then it can be shown that 



I p=i J (i,i)ec k 

for large n (r y and p y have already been defined in (2) and (3)). The bias term (second term) is being 
minimized in ASMDF based on g(k). Note that the first term is added due to the noisy environment and 
is constant (equal to r y (0)) for white noise error. Otherwise, even for well behaved stationary errors it is 
a slowly varying function ensuring robustness of our procedure under (1). 

B. COMPUTATIONAL COMPLEXITY 

Computational complexity of ASMDF (comes out roughly to be n(4n+5.5)), which is of order of n 2 , 
is a bit higher than that of autocorrelation function (viz. 2n(n-l)) and AMDF (viz. (n-l)(3n-l)), which 
are also of orders of n 2 , n being length of a window. 
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